{
    "cells": [
        {
            "cell_type": "markdown",
            "id": "e541c8c8",
            "metadata": {
                "id": "e541c8c8"
            },
            "source": [
                "<center>\n",
                "    <p style=\"text-align:center\">\n",
                "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
                "        <br>\n",
                "        <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
                "        |\n",
                "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
                "        |\n",
                "        <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
                "    </p>\n",
                "</center>"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "1c2c8e6f",
            "metadata": {
                "id": "1c2c8e6f"
            },
            "source": [
                "# Trace-Level Evals for a Movie Recommendation Agent"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "a68d55ab",
            "metadata": {
                "id": "a68d55ab"
            },
            "source": [
                "This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.\n",
                "\n",
                "In this notebook, you will:\n",
                "- Build and capture interactions (traces) from your movie recommendation agent\n",
                "- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage\n",
                "- Format the evaluation outputs to match Arize’s schema and log them to the platform\n",
                "- Learn a robust pipeline for assessing trace-level performance\n",
                "\n",
                "✅ You will need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an OpenAI API key to run this notebook."
            ]
        },
        {
            "cell_type": "markdown",
            "id": "549464f8",
            "metadata": {
                "id": "549464f8"
            },
            "source": [
                "# Set Up Keys & Dependencies"
            ]
        },
        {
            "cell_type": "code",
            "id": "5b425a65",
            "metadata": {
                "collapsed": true,
                "id": "5b425a65"
            },
            "source": [
                "%pip install openinference-instrumentation-openai openinference-instrumentation-openai-agents openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio openai openai-agents"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "id": "15652859",
            "metadata": {
                "id": "15652859"
            },
            "source": [
                "import os\n",
                "from getpass import getpass\n",
                "\n",
                "import nest_asyncio\n",
                "\n",
                "nest_asyncio.apply()\n",
                "\n",
                "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n",
                "    phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n",
                "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n",
                "\n",
                "\n",
                "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n",
                "    phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n",
                "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n",
                "\n",
                "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
                "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
                "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "id": "f47e0638",
            "metadata": {
                "id": "f47e0638"
            },
            "source": [
                "# Configure Tracing"
            ]
        },
        {
            "metadata": {},
            "cell_type": "code",
            "source": [
                "from phoenix.otel import register\n",
                "\n",
                "# configure the Phoenix tracer\n",
                "tracer_provider = register(project_name=\"movie-rec-agent\", auto_instrument=True)"
            ],
            "id": "36dbc00a43257651",
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "id": "0a3919bd",
            "metadata": {
                "id": "0a3919bd"
            },
            "source": [
                "# Build Movie Recommendation System"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "7c81c9bb",
            "metadata": {
                "id": "7c81c9bb"
            },
            "source": [
                "First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:\n",
                "1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming\n",
                "2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.\n",
                "3. Preview Summarizer: For each movie, return a 1-2 sentence description\n",
                "\n",
                "Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews."
            ]
        },
        {
            "cell_type": "markdown",
            "id": "c64844a7",
            "metadata": {
                "id": "c64844a7"
            },
            "source": [
                "Let's test our agent & view traces in Arize"
            ]
        },
        {
            "cell_type": "code",
            "id": "f1731db8",
            "metadata": {
                "id": "f1731db8"
            },
            "source": [
                "import ast\n",
                "from typing import List, Union\n",
                "\n",
                "from agents import Agent, Runner, function_tool\n",
                "from openai import OpenAI\n",
                "from opentelemetry import trace\n",
                "\n",
                "tracer = trace.get_tracer(__name__)\n",
                "\n",
                "client = OpenAI()\n",
                "\n",
                "\n",
                "@function_tool\n",
                "def movie_selector_llm(genre: str) -> List[str]:\n",
                "    prompt = (\n",
                "        f\"List up to 5 recent popular streaming movies in the {genre} genre. \"\n",
                "        \"Provide only movie titles as a Python list of strings.\"\n",
                "    )\n",
                "    response = client.chat.completions.create(\n",
                "        model=\"gpt-4\",\n",
                "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
                "        temperature=0.7,\n",
                "        max_tokens=150,\n",
                "    )\n",
                "    content = response.choices[0].message.content\n",
                "    try:\n",
                "        movie_list = ast.literal_eval(content)\n",
                "        if isinstance(movie_list, list):\n",
                "            return movie_list[:5]\n",
                "    except Exception:\n",
                "        return content.split(\"\\n\")\n",
                "\n",
                "\n",
                "@function_tool\n",
                "def reviewer_llm(movies: Union[str, List[str]]) -> str:\n",
                "    if isinstance(movies, list):\n",
                "        movies_str = \", \".join(movies)\n",
                "        prompt = f\"Sort the following movies by rating from highest to lowest and provide a short review for each:\\n{movies_str}\"\n",
                "    else:\n",
                "        prompt = f\"Provide a short review and rating for the movie: {movies}\"\n",
                "    response = client.chat.completions.create(\n",
                "        model=\"gpt-4\",\n",
                "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
                "        temperature=0.7,\n",
                "        max_tokens=300,\n",
                "    )\n",
                "    return response.choices[0].message.content.strip()\n",
                "\n",
                "\n",
                "@function_tool\n",
                "def preview_summarizer_llm(movie: str) -> str:\n",
                "    prompt = f\"Write a 1-2 sentence summary describing the movie '{movie}'.\"\n",
                "    response = client.chat.completions.create(\n",
                "        model=\"gpt-4\",\n",
                "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
                "        temperature=0.7,\n",
                "        max_tokens=100,\n",
                "    )\n",
                "    return response.choices[0].message.content.strip()"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "id": "8ee91369",
            "metadata": {
                "id": "8ee91369"
            },
            "source": [
                "agent = Agent(\n",
                "    name=\"MovieRecommendationAgentLLM\",\n",
                "    tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],\n",
                "    instructions=(\n",
                "        \"You are a helpful movie recommendation assistant with access to three tools:\\n\"\n",
                "        \"1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n\"\n",
                "        \"2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n\"\n",
                "        \"3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\n\"\n",
                "        \"Your goal is to provide a helpful, user-friendly response combining relevant information.\"\n",
                "    ),\n",
                ")\n",
                "\n",
                "\n",
                "async def main():\n",
                "    user_input = \"Which comedy movie should I watch?\"\n",
                "    result = await Runner.run(agent, user_input)\n",
                "    print(result.final_output)\n",
                "\n",
                "\n",
                "await main()"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "id": "6bef5b97",
            "metadata": {
                "id": "6bef5b97"
            },
            "source": [
                "Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit."
            ]
        },
        {
            "cell_type": "code",
            "id": "5cf62bca",
            "metadata": {
                "id": "5cf62bca"
            },
            "source": [
                "questions = [\n",
                "    \"Which Batman movie should I watch?\",\n",
                "    \"I want to watch a good romcom\",\n",
                "    \"What is a very scary horror movie?\",\n",
                "    \"Name a feel-good holiday movie\",\n",
                "    \"Recommend a musical with great songs\",\n",
                "    \"Give me a classic drama from the 90s\",\n",
                "]\n",
                "\n",
                "for question in questions:\n",
                "    result = await Runner.run(agent, question)"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "id": "f819c8ee",
            "metadata": {
                "id": "f819c8ee"
            },
            "source": [
                "# Get Span Data from Phoenix"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "0de0363e",
            "metadata": {
                "id": "0de0363e"
            },
            "source": [
                "Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values."
            ]
        },
        {
            "metadata": {},
            "cell_type": "code",
            "source": [
                "from phoenix.client import AsyncClient\n",
                "\n",
                "px_client = AsyncClient()\n",
                "primary_df = await px_client.spans.get_spans_dataframe(project_identifier=\"movie-rec-agent\")"
            ],
            "id": "981913a80fa0a780",
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "id": "91b78eee",
            "metadata": {
                "id": "91b78eee"
            },
            "source": [
                "import pandas as pd\n",
                "\n",
                "trace_df = primary_df.groupby(\"context.trace_id\").agg(\n",
                "    {\n",
                "        \"attributes.input.value\": \"first\",\n",
                "        \"attributes.output.value\": lambda x: \" \".join(x.dropna()),\n",
                "    }\n",
                ")\n",
                "\n",
                "trace_df.head()"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "id": "d014913f",
            "metadata": {
                "id": "d014913f"
            },
            "source": [
                "# Define and Run Evaluators"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "a426e4ca",
            "metadata": {
                "id": "a426e4ca"
            },
            "source": [
                "In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge."
            ]
        },
        {
            "cell_type": "code",
            "id": "d0f1768d",
            "metadata": {
                "id": "d0f1768d"
            },
            "source": [
                "TOOL_CALLING_ORDER = \"\"\"\n",
                "You are evaluating the correctness of the tool calling order in an LLM application's trace.\n",
                "\n",
                "You will be given:\n",
                "1. The user input that initiated the trace\n",
                "2. The full trace output, including the sequence of tool calls made by the agent\n",
                "\n",
                "##\n",
                "User Input:\n",
                "{attributes.input.value}\n",
                "\n",
                "Trace Output:\n",
                "{attributes.output.value}\n",
                "##\n",
                "\n",
                "Respond with exactly one word: `correct` or `incorrect`.\n",
                "1. `correct` →\n",
                "- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively.\n",
                "- A proper answer involves calls to reviews, summaries, and recommendations where relevant.\n",
                "2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.\n",
                "\"\"\""
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "id": "131a5e7f",
            "metadata": {
                "id": "131a5e7f"
            },
            "source": [
                "RECOMMENDATION_RELEVANCE = \"\"\"\n",
                "You are evaluating the relevance of movie recommendations provided by an LLM application.\n",
                "\n",
                "You will be given:\n",
                "1. The user input that initiated the trace\n",
                "2. The list of movie recommendations output by the system\n",
                "\n",
                "##\n",
                "User Input:\n",
                "{attributes.input.value}\n",
                "\n",
                "Recommendations:\n",
                "{attributes.output.value}\n",
                "##\n",
                "\n",
                "Respond with exactly one word: `correct` or `incorrect`.\n",
                "1. `correct` →\n",
                "- All recommended movies match the requested genre or criteria in the user input.\n",
                "- The recommendations should be relevant to the user's request and shouldn't be repetitive.\n",
                "- `incorrect` → one or more recommendations do not match the requested genre or criteria.\n",
                "\"\"\""
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "id": "ce8d7f1b",
            "metadata": {
                "id": "ce8d7f1b"
            },
            "source": [
                "import os\n",
                "\n",
                "import nest_asyncio\n",
                "\n",
                "from phoenix.evals import OpenAIModel, llm_classify\n",
                "\n",
                "nest_asyncio.apply()\n",
                "\n",
                "model = OpenAIModel(\n",
                "    api_key=os.environ[\"OPENAI_API_KEY\"],\n",
                "    model=\"gpt-4o-mini\",\n",
                "    temperature=0.0,\n",
                ")\n",
                "\n",
                "rails = [\"correct\", \"incorrect\"]\n",
                "\n",
                "tool_eval_results = llm_classify(\n",
                "    dataframe=trace_df,\n",
                "    template=TOOL_CALLING_ORDER,\n",
                "    model=model,\n",
                "    rails=rails,\n",
                "    provide_explanation=True,\n",
                "    verbose=False,\n",
                ")\n",
                "\n",
                "tool_eval_results"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "id": "efd9620e",
            "metadata": {
                "id": "efd9620e"
            },
            "source": [
                "relevance_eval_results = llm_classify(\n",
                "    dataframe=trace_df,\n",
                "    template=RECOMMENDATION_RELEVANCE,\n",
                "    model=model,\n",
                "    rails=rails,\n",
                "    provide_explanation=True,\n",
                "    verbose=False,\n",
                ")\n",
                "\n",
                "relevance_eval_results"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "markdown",
            "id": "4613e975",
            "metadata": {
                "id": "4613e975"
            },
            "source": [
                "# Log Results Back to Phoenix"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "afd4401d",
            "metadata": {
                "id": "afd4401d"
            },
            "source": [
                "The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations."
            ]
        },
        {
            "cell_type": "code",
            "id": "aec87930",
            "metadata": {
                "id": "aec87930",
                "ExecuteTime": {
                    "end_time": "2025-08-25T21:08:47.221747Z",
                    "start_time": "2025-08-25T21:08:47.174358Z"
                }
            },
            "source": [
                "root_spans = primary_df[primary_df[\"parent_id\"].isna()][[\"context.trace_id\", \"context.span_id\"]]\n",
                "\n",
                "tool_eval_results = tool_eval_results[[\"label\", \"explanation\"]]\n",
                "\n",
                "# Merge tool correctness eval results with trace_df\n",
                "tool_correctness_df = pd.merge(\n",
                "    trace_df, tool_eval_results, left_index=True, right_index=True, how=\"left\"\n",
                ")\n",
                "\n",
                "# Merge with root spans to get valid span IDs\n",
                "tool_correctness_df = pd.merge(\n",
                "    tool_correctness_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
                ").set_index(\"context.span_id\", drop=False)\n",
                "\n",
                "relevance_eval_results = relevance_eval_results[[\"label\", \"explanation\"]]\n",
                "\n",
                "# Merge relevance eval results with trace_df\n",
                "relevance_df = pd.merge(\n",
                "    trace_df, relevance_eval_results, left_index=True, right_index=True, how=\"left\"\n",
                ")\n",
                "\n",
                "# Merge with root spans to get valid span IDs\n",
                "relevance_df = pd.merge(\n",
                "    relevance_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
                ").set_index(\"context.span_id\", drop=False)\n",
                "\n",
                "\n",
                "# Log to Phoenix\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=tool_correctness_df,\n",
                "    annotation_name=\"Tool Correctness\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=relevance_df,\n",
                "    annotation_name=\"Recommendation Relevance\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")"
            ],
            "outputs": [],
            "execution_count": 32
        },
        {
            "cell_type": "markdown",
            "id": "xVG-DsKlWb6c",
            "metadata": {
                "id": "xVG-DsKlWb6c"
            },
            "source": [
                "![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace_level_evals_phoenix.png)"
            ]
        }
    ],
    "metadata": {
        "colab": {
            "provenance": []
        },
        "kernelspec": {
            "display_name": "Python 3",
            "name": "python3"
        },
        "language_info": {
            "name": "python"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 5
}
